How to Keep OCR Errors from Spoiling Your ediscovery Party



Similar documents
In addition, a decision should be made about the date range of the documents to be scanned. There are a number of options:

Best Practices: ediscovery Search

Optical Character Recognition (OCR)

WHAT MATTERS MOST TO CORPORATE COUNSEL IN E-DISCOVERY MANAGEMENT. Presenting the results from BDO s inaugural Inside E-Discovery Survey

Selecting the Right ediscovery Solution for Your Company

A. Scan to PDF Instructions

Reduce Cost and Risk during Discovery E-DISCOVERY GLOSSARY

Electronic Discovery

Nuance Power PDF Advanced.

Connections to External File Sources

ELECTRONIC DOCUMENT IMAGING

Are you ready for more efficient and effective ways to manage discovery?

Delivering Global Ediscovery Successfully. Emily A. Cobb, Ropes & Gray Andrew Szczech, Kroll Ontrack Thomas Sely, Kroll Ontrack

HIPAA Audits and Compliance: What To Expect From Regulators and How to Comply

How To Use A Court Record Electronically In Idaho

AccuRead OCR. Administrator's Guide

Guidelines for the submission of invoices

A Day in the Life of an Ediscovery Case Manager. June 4, 2014

ediscovery Software Buyer s Guide FOR SMALL LAW FIRMS

Electronic Document Management: The Basics

Litigation Support. Learn How to Talk the Talk. solutions. Document management

Xerox Multifunction Devices. Verify Device Settings via the Configuration Report

New York State Archives Digital Imaging Guidelines (2014) 1

How To Scan A Document

Office of History. Using Code ZH Document Management System

Providing reliable document management solutions for any budget is how we re engineering a better world. KV-S1026C SOLUTIONS FOR BUSINESS

Going Paperless The Utah Experience. Mike Pecorelli Project Manager Utah DEQ

ADOBE ACROBAT X PRO SCAN AND OPTICAL CHARACTER RECOGNITION (OCR)

Scanning for OCR Text Conversion

File Formats for Electronic Document Review Why PDF Trumps TIFF

Best Practices: Defensibly Collecting, Reviewing, and Producing

Redefining High Speed ediscovery Processing & Production

AccuRead OCR. Administrator's Guide

Navigate your workflow

P2WW ENZ0. How to use ScandAll PRO

Importing PDF Files in WordPerfect Office

Five Steps to Ensure a Technically Accurate Document Production

Customer Tips. Xerox Network Scanning TWAIN Configuration for the WorkCentre 7328/7335/7345. for the user. Purpose. Background

Scanning and Tossing. Requirements for Scanning and the Destruction of Paper Based Records

Quick Reference Guide. copy. locked print. fax. scan-to-

Lexmark T64x Troubleshooting Guide

Scan to Network and Scan to Network Premium. Administrator's Guide

CASE NO. 279 CIVIL ACTION APPLICABLE TO ALL CASES CASE MANAGEMENT ORDER NO.5

Metadata, Electronic File Management and File Destruction

Court Services Online - e-filing. Frequently Asked Questions

Archiving digital documents and s in PDF/A

5.2 Drawing Information Consultants Procedure Manual. 5.2 Drawing Information: Large Format Document Standards: Page 1 of 5

Introduction to OCR in Revu

Solving print quality problems

ABBYY Version 12 User s Guide. FineReader ABBYY Production LLC. All rights reserved.

Contents. A July 2008 i

Leverage SharePoint with PSI:Capture

Table of Contents 2. Table of Contents

WHAT You SHOULD KNOW ABOUT SCANNING

Digital Forensics, ediscovery and Electronic Evidence

Nuance Power PDF is PDF uncompromised.

Migrating business users to a robust alternative desktop PDF solution is now possible and it s easier than you think.

GCP - Records Managers Association

Konica Minolta Unity Document Suite. Powerful integrated document processing. Document capture & distribution Unity Document Suite

Public Works and Services. Guidelines for Scanning Projects

Document Capture and Distribution

Backfile Conversion: Best Practices and Considerations for Electronic Document Management

Continuous Workflow. Business Document Scanning. Engineered for. Fast, accurate and reliable digital conversion for batch scanning and archiving

UNITED STATES DISTRICT COURT DISTRICT OF MINNESOTA

Cloud Portal for imagerunner ADVANCE

Toolbox 4.3. System Requirements

CASE STUDY Bringing benefits & added value with e-archiving NV Logistics. e-freight

Best Practices Page 1

Document Scanning Essentials

Document Management Solutions

Quality Assurance Guidance

E-Discovery Tip Sheet

The Bell Curve & Document Indexing/Imaging

(Previously published in The Legal Intelligencer, November 8, 2011) New Cost Guidelines for E-Discovery by Peter Vaira

ABBYY FineReader 12 Corporate

Personal Paperless Document Manager Customer Orientation Guide

DATASHEET Scanned images saved as TIFF or image PDF. s with TIFF or image-based PDF attachments

AR-NB2 A NETWORK EXPANSION KIT. OPERATION MANUAL (for network scanner) SCANNER FUNCTION 17 USING THE NETWORK

CHAPTER 13 ELECTRONIC DOCUMENT MANAGEMENT SYSTEM (EDMS) REQUIREMENTS

Transcription:

How to Keep OCR Errors from Spoiling Your ediscovery Party ACEDS Webinar, May 21 st 2014 2002-2013 Nuance Communications, Inc. All rights reserved. Page 1

ACEDS Membership Benefits Training, Resources and Networking for the E-Discovery Community! Exclusive News and Analysis! Weekly Web Seminars! Podcasts! On- Demand Training! Networking! Resources! Jobs Board & Career Center! bits + bytes NewsleRer! CEDS CerSficaSon! And Much More! ACEDS provides an excellent, much needed forum to train, network and stay current on critical information. Kimarie Stratos, General Counsel, Memorial Health Systems, Ft. Lauderdale Join Today! aceds.org/join or Call ACEDS Member Services 786-517- 2701

Speaker Introduction Greg Gies Director, Product Marketing, Imaging Nuance Communications Leads go-to-market planning for Nuance s print, capture & PDF solutions. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 3

Agenda What OCR is. What are OCR errors and what causes them. Why ediscovery professionals should care. How common these problems are. What can be done to prevent OCR errors. How to correct OCR errors when they happen. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 4

What is OCR? Wikipedia: Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine encoded / computer-readable text. 2009 Computerworld Article: Optical character recognition (OCR) is the translation of optically scanned bitmaps of printed or written text characters into character codes. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 5

I say, OCR is the digital transcription of bitmaps containing machine-printed text into encoded text characters, using a coding scheme such as ASCII, which among other capabilities enables indexing software to decipher textual elements contained within bitmaps. Encoded text is searchable text. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 6

What are OCR errors and what causes them. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 7

Types of OCR errors Transcription errors: Result: misspelled words. Impact: Unsearchable. Proportional fonts are especially problematic. Formatting errors. Deleted metadata: Example, learning becomes leaming Result: poor legibility. Impact: Unsearchable. Example, learning becomes l e a r n i n g Result: data loss. Impact. Metadata is unsearchable. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 8

Some image defects and causes Faulty printing equipment DEFECT Toner specks Vertical lines Toner smear Gray background Page skew Light/dark print CAUSE Defective toner cartridge Worn rollers Wrong settings Worn pickup roller Low toner or ink cartridge Clogged nozzles Paper or form elements Faulty scanning equipment Halftones Vertical/horizontal lines Noise Specks Page skew Light/dark image Colored paper Carbon copies Shaded and lined forms Low/high contrast background Dirty platen Feeder misadjusted Worn pickup roller Low resolution Misfeed 2002-2013 Nuance Communications, Inc. All rights reserved. Page 9

Examples of image defects Specks Fuzzy edges Color Background Halftone Gridlines 2002-2013 Nuance Communications, Inc. All rights reserved. Page 10

Cause & effect SKEWED IMAGE DARK TEXT TONER SPECKS 2002-2013 Nuance Communications, Inc. All rights reserved. Page 11

OCR results... ;., 1HIt#~!':tl'Ol\<fLo\i;l;i\~: 'do ~""r.bf...,200., by- ~ ODNEY D~"'!\Ob$Y~s.IN(;.. ~~"""""" Iaws "'''''.S_",c.rm",,;, ~,-"'"'_U "s.,.",) JOE STANDUi' ~Irnown """,,"). _ and SclIa _<01"'"""y 'l>eknowhllereinasutlj.e paiti~s",'......... " 2002-2013 Nuance Communications, Inc. All rights reserved. Page 12

Formatting errors Original Image OCR conversion to.docx 2002-2013 Nuance Communications, Inc. All rights reserved. Page 13

Why should ediscovery professionals care about OCR errors? 2002-2013 Nuance Communications, Inc. All rights reserved. Page 14

PDF Files Created by Scanners Aren t Necessarily Searchable 1. Unlike PDF Files that were born digital, a PDF file from a scanner is an image of a paper document. 2. While the text in an image may appear similar or the same as text in a born digital PDF, it s invisible as far as search algorithms are concerned. 3. Images aren t searchable until processed with OCR. 4. Searchable PDF contains OCR output in its metadata. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 15

Digitized Versus Native Documents Why special handling is required Electronic Originated online Many native file formats Encoded text machinereadable ediscovery software designed for these documents Digitized Scans of paper originals TIFF & PDF common formats Text is a bitmap; not encoded OCR makes scans machinereadable 2002-2013 Nuance Communications, Inc. All rights reserved. Page 16

All Searchable PDFs Aren t Equally Findable 2002-2013 Nuance Communications, Inc. All rights reserved. Page 17

All Searchable PDFs Aren t Equally Findable OCR Word Accuracy Matters Character vs. word accuracy 1character error = 1 word error 1 char / 10,000 =.0001 1 word / 1,000 =.001 Seemingly small differences in OCR error rate lead to very large differences in errors 98.25% vs. 96.05% 948 pages ~ 2% delta Over 12,000 more word errors 2002-2013 Nuance Communications, Inc. All rights reserved. Page 18

PDF Isn t Just PDF Anymore Improper handling may lead to data loss Electronic sticky notes & stamps obscure text behind PDF contains native pages or elements PDF contains digitized pages or elements OCR software may flatten the image, i.e., convert PDF to TIF, then OCR 2002-2013 Nuance Communications, Inc. All rights reserved. Page 19

Four Important Ways PDF is Different 1. PDF files can be assembled from multiple files. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 20

Four Important Ways PDF is Different 2. PDF elements can be rearranged. User copied snip of scanned document. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 21

Four Important Ways PDF is Different 3. PDF files can have multiple layers. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 22

Four Important Ways PDF is Different 4. PDF files can be born digitally or created by a scanner. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 23

Why These Differences Matter When these properties converge with scanned text ediscovery pitfalls arise that can lead to inadvertent data loss due to processing errors. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 24

Single PDF Page Can Contain Both Born Digital and Scanned Content This PDF contains 2 scanned pages. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 25

Notes, Text Boxes, Callouts, Stamps, Etc Can Be Overlaid On Images 2002-2013 Nuance Communications, Inc. All rights reserved. Page 26

How Data Gets Inadvertently Destroyed OCR may flatten objects overlaid on text, making the text underneath unreadable and metadata unsearchable 2002-2013 Nuance Communications, Inc. All rights reserved. Page 27

Why Is Data Destruction A Problem? Spoliation is the destruction or significant alteration of evidence, or the failure to preserve property for another's use as evidence in pending or reasonably foreseeable litigation. Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php 2002-2013 Nuance Communications, Inc. All rights reserved. Page 28

Penalties for Spoliation $2,750,000 United States v. Philip Morris USA, Inc. Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php 2002-2013 Nuance Communications, Inc. All rights reserved. Page 29

What If The Data Was Destroyed Unintentionally? I m Okay Right? The intent to alter or destroy electronic data is not required for spoliation to occur. Source: http://www.gwblawfirm.com/ap-spoliation-a-trap-for-the-unwary.php 2002-2013 Nuance Communications, Inc. All rights reserved. Page 30

Other Reasons Data Loss is Bad Spoliation isn t the only concern Destruction of evidence affect case outcome. Time wasted recreating lost data reduced profitability. Client relations loss of credibility and future business. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 31

How common are OCR errors? 2002-2013 Nuance Communications, Inc. All rights reserved. Page 32

Typical OCR Word Accuracy Rates 2002-2013 Nuance Communications, Inc. All rights reserved. Page 33

Firms With Initiatives To Convert Paper Documents And Process To Digital 100% Percentage of Respondents 80% 60% 40% 20% Don t know No Yes 0% Total 2002-2013 Nuance Communications, Inc. All rights reserved. Page 34

What Percentage Of The Knowledge Workers Have Access To Scanners? 100% 80% 72.8% 66.7% 75.2% 64.5% 78.6% 80.9% Workers 60% 40% 20% 0% Total Education Financial Healthcare Insurance Legal 2002-2013 Nuance Communications, Inc. All rights reserved. Page 35

The Use Of Scanning Within Organizations 100% Percentage of Respondents 80% 60% 40% 20% Decreasing Stay the same Increasing 0% Total 2002-2013 Nuance Communications, Inc. All rights reserved. Page 36

Why scanning is increasing 100% Percentage of Respondents 80% 60% 40% 20% More people have been given access to scanning Mix of more documents being scanned and more people gaining access to scanning Individuals are scanning more documents 0% Total N = 158 Respondents who have seen an increase in scanning 2002-2013 Nuance Communications, Inc. All rights reserved. Page 37

What can be done to prevent OCR errors from occurring 2002-2013 Nuance Communications, Inc. All rights reserved. Page 38

Image enhancement filters out defects Despeckle Smooth Characters Halftone Removal Remove Gridlines Color Background Removal 2002-2013 Nuance Communications, Inc. All rights reserved. Page 39

Selectively Processing PDF Files Isn t a Viable Strategy Client delivers a disk with millions of files, how do you know which PDF are compound that require special handling? It s kind of like looking for needles in a haystack. 2002-2013 Nuance Communications, Inc. All rights reserved. Page 40

Segregate All PDF Files Before Conversion After collecting documents, but before processing, review and analysis Identify all PDF documents Move them to a separate directory to be run through OCR 2002-2013 Nuance Communications, Inc. All rights reserved. Page 41

Pre-process all PDF files with OCR Text images are potentially hidden within files Newer OCR tools will make image text searchable, without disturbing the rest of the data within the document Ensures all text within every PDF document can be searched by ediscovery system Should also convert to PDF/A at the same time so files are ready for court filing 2002-2013 Nuance Communications, Inc. All rights reserved. Page 42

How to correct OCR errors when they happen 2002-2013 Nuance Communications, Inc. All rights reserved. Page 43

Post processing error-correction options 1. Proofreading too time-intensive 2. Run an automated spell check Effective at identifying and correcting spelling errors Doesn t solve contextual errors Example, How is you day? Spelled correctly but clearly is wrong If searching for your this instance won t be found No commercial solutions today solve this problem Helps to understand this potential problem Can either manually review and correct or adjust search strategy to minimize the impact, e.g., fuzzy search techniques 2002-2013 Nuance Communications, Inc. All rights reserved. Page 44

Final Thought Why You Should Care Important enough to print, then to scan, it is important! General business / transaction documents Purchase orders, invoices, contracts, employee identification documents Healthcare organizations / patient medical records Physician s notes, discharge summaries, test results, post operative reports, etc Personal identification documents Driver s licenses, social security cards, professional certificates Public institutions / Police records Incident / accident reports, police logs, court records Insurance & banking Claims documents, medical records, financial records Schools Inoculation records, transcripts, applications 2002-2013 Nuance Communications, Inc. All rights reserved. Page 45

Q&A 2002-2013 Nuance Communications, Inc. All rights reserved. Page 46

Next Steps A recording of today s Webinar will be available shortly for you to review at your leisure Contact Nuance with any questions you may have: 781-565-5000 or imaging@nuance.com For more information visit: http://www.nuance.com/for-business/by-industry/legal/legal-solution Try our 30-Day free trial of Power PDF at http://www.powerpdf.com 2002-2013 Nuance Communications, Inc. All rights reserved. Page 47

Thank you 2002-2013 Nuance Communications, Inc. All rights reserved. Page 48