Scanning for OCR Text Conversion

Similar documents
EPSON SCANNING TIPS AND TROUBLESHOOTING GUIDE Epson Perfection 3170 Scanner

ADOBE ACROBAT X PRO SCAN AND OPTICAL CHARACTER RECOGNITION (OCR)

EPSON PERFECTION SCANNING BASICS

Reflection and Refraction

Chapter 5 Objectives. Chapter 5 Input

How to Keep OCR Errors from Spoiling Your ediscovery Party

How To Use A Court Record Electronically In Idaho

How To Use An Epson Scanner On A Pc Or Mac Or Macbook

AccuRead OCR. Administrator's Guide

OneTouch 4.0 with OmniPage OCR Features. Mini Guide

Film Scanner The term Film Scanner can refer to a dedicated slide and negative film scanner or to a capture type scanner.

AccuRead OCR. Administrator's Guide

A New Imaging System with a Stand-type Image Scanner Blinkscan BS20

A. Scan to PDF Instructions

REMOTE DEPOSIT CAPTURE MARKET

Going Paperless The Utah Experience. Mike Pecorelli Project Manager Utah DEQ

Scanners and How to Use Them

Quick Reference. Store this guide next to the machine for future reference. ENG

Enjoy easy, fast and versatile scanning

Perfection V800 Photo/V850 Pro User's Guide

Form Design Guidelines Part of the Best-Practices Handbook. Form Design Guidelines

Alternative Methods Of Input. Kafui A. Prebbie 82

Introduction to OCR in Revu

CALCULATING THE SIZE OF AN ATOM

RIEGL VZ-400 NEW. Laser Scanners. Latest News March 2009

Document scanning. Done right. TM

Scan to PC. Create a scan profile Custom Scan to PC settings Make copies. Send faxes

OCR and 2D DataMatrix Specification for:

WHAT You SHOULD KNOW ABOUT SCANNING

Digitization of Old Maps Using Deskan Express 5.0

Optical Character Recognition (OCR)

Scanning Scanning images. cover

Search Requests (Overview)

Graphic Design. Background: The part of an artwork that appears to be farthest from the viewer, or in the distance of the scene.

The Theory and Practice of Using a Sine Bar, version 2

Guidelines for the submission of invoices

CONTINIA DOCUMENT CAPTURE FACTSHEET ENGLISH LANGUAGE

LIST OF CONTENTS CHAPTER CONTENT PAGE DECLARATION DEDICATION ACKNOWLEDGEMENTS ABSTRACT ABSTRAK

Get the Best Digital Images Possible. What s it all about anyway?

SUBMITTING A PRESS-READY COVER For Paperback Books with Perfect Binding, Plastic Comb, and Plastic Coil Binding

ABBYY FineReader 12 Corporate

Using Web Services for scanning on your network (Windows Vista SP2 or greater, Windows 7 and Windows 8)

Advanced Scanning Techniques

Snap 9 Professional s Scanning Module

TENDER DOCUMENT FOR DIGITIZATION OF PUBLICATIONS OF BIS

Color quality guide. Quality menu. Color quality guide. Page 1 of 6

File Formats for Electronic Document Review Why PDF Trumps TIFF

Samsung Rendering Engine for Clean Pages (ReCP) Printer technology that delivers professional-quality prints for businesses

In addition, a decision should be made about the date range of the documents to be scanned. There are a number of options:

LIBRARY OF CONGRESS NATIONAL DIGITAL LIBRARY PROGRAM and the CONSERVATION DIVISION

Graphic Communication

Adobe Acrobat 6.0 Professional

ABBYY Version 12 User s Guide. FineReader ABBYY Production LLC. All rights reserved.

Pictorial User s Guide

Data rescue and digitization: tips and tricks resulting from the Dutch experience

How To Scan A Document

Lenses and Telescopes

Selecting the Correct Automatic Identification & Data Collection Technologies for your Retail Distribution Center Application

Produce Traceability Initiative Best Practices for Direct Print

Elaboration of Scrum Burndown Charts.

Translation and Translation Verification

Prof. Dr. M. H. Assal

WHITE PAPER DECEMBER 2010 CREATING QUALITY BAR CODES FOR YOUR MOBILE APPLICATION

Best practices for producing high quality PDF files

PaperPort Getting Started Guide

The Mathematics 11 Competency Test Percent Increase or Decrease

Map Patterns and Finding the Strike and Dip from a Mapped Outcrop of a Planar Surface

How to Prepare a Book for Press With Scribus

Adobe Acrobat 9 Pro Accessibility Guide: PDF Accessibility Overview

Navigation Aid And Label Reading With Voice Communication For Visually Impaired People

10.1 FUNCTIONS OF INPUT AND OUTPUT DEVICES

4.4 Refractive Index of Novel Carbazole Based Methacrylates, Acrylates, and Dimethacrylates

Variables Control Charts

Input and Output. Chapter 6

HP Scanjet G4000 series. User Guide

Artwork - What Do I Need To Know Before I Start Printing?

DALLAS COUNTY DISTRICT CLERK NEW STATEWIDE RULES FOR E FILING

Digital Messages GUIDELINES

Application Example: Quality Control. Turbines: 3D Measurement of Water Turbines

File by OCR Manual. Updated December 9, 2008

The Answer to the 14 Most Frequently Asked Modbus Questions

Technical Tip Image Resolutions for Digital Cameras, Scanners, and Printing

Kurzweil 3000 version General User Guide

Essentials for Ratio Analysis

Going Paperless. There are many implications

Confidence Intervals for Cpk

Consistent Assay Performance Across Universal Arrays and Scanners

Connections to External File Sources

Corporate Records Scanning Strategy

What Resolution Should Your Images Be?

Equations, Lenses and Fractions

MACHINE VISION FOR SMARTPHONES. Essential machine vision camera requirements to fulfill the needs of our society

Transcription:

Scanning for OCR Text Conversion Background Optical Character Recognition (OCR) is the recognition and interpretation of text in a digital image. Its main application is the translation of printed type into an editable and searchable document. Modern OCR software has become very accurate, however; in many ways it is still largely dependent on the quality of the scanned image. In particular, when dealing with bound print sources such as novels and thick text-books that, due to physical limitations, are not easily read by optical scanning devices, it is highly desirable to have a device that can produce high quality images, capture all the information and give consistent results for every page. Aim To measure the accuracy of OCR when a thick-spine book is scanned with a conventional flatbed face-down scanner versus BookDrive DIY and to show how the image quality and ultimately the scanning method is related to this. Method We compared the accuracy of OCR text conversion on pre-processed images from a conventional flatbed face-down scanner (typical in most office environments) and BookDrive DIY. The sample book (shown to the left) is a hard-cover 1421 page text-book in good condition. The font type and size is quite standard, the page background is bright and white and there are no images. We scanned four pages from the center of the book (pages 700, 701, 702 and 703) at 200 and 300 dpi resolutions to produce color JPEG images without further manipulation or processing. For the sake of simplicity we will give three different performance measures rather than attempting to apply a single, all-encompassing formula.

The first and most important measure is our OCR Accuracy measure. Accuracy = 100 (missed characters + uncertain characters) * 100 / actual total characters This measures the percentage of printed type that is recognized and confidently interpreted. The second measure is our Image Quality measure. Image Quality = interpreted characters * 100 / actual total characters This measures the percentage of printed type that failed to be recognized as text at all, regardless of whether characters were interpreted correctly or not. The third measure is our OCR Confidence measure. Confidence = 100 uncertain characters * 100 / interpreted characters This measures the percentage of interpreted characters that are questionable or cannot be interpreted with any real level of confidence. Image resolution may be an important factor here. Observations Before we look at the figures it is worth mentioning the difficulty we experienced in scanning this type of thick-spine book with a flatbed face-down scanner. In particular, a lot of handling was involved, not only in flipping the book over to turn the page but in repositioning the book before each scan so that each page lay within the boundary of the A4 scanning region. Furthermore, considerable force had to be applied to flatten the pages near the page binding in order to capture as much of the text as possible. In fact some pages had to be scanned two or three times to obtain an acceptable image but even then results were less than satisfactory. In terms of ease-of-use and speed or throughput, flatbed face-down scanners are clearly more cumbersome and much slower than BookDrive DIY and they rely heavily on the ability of the user to correctly and consistently position the book each time before scanning. An image sample from the flatbed scanner The image above shows page 700 of the text-book when scanned with the flatbed scanner. The image has not been processed or enhanced in any way.

The OCR text recognition area The image above shows the boundary of the recognizable text area as identified by our OCR software. The area close to the page binding is dark and fuzzy and the text is unreadable. An image sample from BookDrive DIY The image above shows page 700 of the text-book when scanned with BookDrive DIY. The image has not been processed or enhanced in any way. The OCR text recognition area The image above shows the boundary of the recognizable text area as identified by our OCR software. The boundary encloses all of the text on the page.

Results OCR accuracy on test sample from flatbed face-down scanner Sample Page Image OCR Accuracy Page Total Resolution (dpi) Interpreted Uncertain 700 3260 200 2858 38 700 3260 300 3060 81 701 3312 200 2811 68 701 3312 300 2867 70 702 3260 200 2953 60 702 3260 300 2937 71 703 2895 200 2561 90 703 2895 300 2580 46 OCR accuracy on test sample from BookDrive DIY Sample Page Image OCR Accuracy Page Total Resolution (dpi) Interpreted Uncertain 700 3260 200 3260 2 700 3260 300 3260 2 701 3312 200 3311 9 701 3312 300 3312 5 702 3260 200 3259 5 702 3260 300 3260 1 703 2895 200 2894 5 703 2895 300 2895 4 The above figures show the ratio of interpreted characters to uncertain characters for each page when OCR was performed on the test samples. For samples from the flatbed scanner, the total number of interpreted characters was much less than the actual total number of characters on the page and the number of uncertain characters was very high. The results for flatbed scanners are inconsistent and there appears to be no direct correlation between image resolution and OCR accuracy. For some pages more characters were interpreted correctly with less uncertain characters using the higher resolution images while for others the opposite was true. For samples from BookDrive DIY, there was only a small deviation in the total number of interpreted characters for the 200 and 300 dpi images and the results were more consistent. In general there were slightly less uncertain characters with the higher resolution (300 dpi) images and the number of interpreted characters was exactly equal to the actual number of characters on the page.

OCR Performance Measures for F latbed Scanner vs BookDrive DIY 100 90 87 97 85 89 97 87 1000 99 99 99 99 100 80 70 60 50 Flatbed (200 dpi) Flatbed (300 dpi) BookDrive DIY (200 dpi) BookDrive DIY (300 dpi) 'Image Quality' 'Confidence' 'Accuracy' The chart above shows OCR performance measures based on images from the flatbed scanner versus BookDrive DIY. Note that figures represent percentage values rounded to 1 decimal place and floored to the nearest whole number. Conclusion Becausee the flatbed scanner produced poor images characterized by a dark fuzzy region near the page binding, OCR Accuracy was very low and a high percentage of printed type (in this region) was missed entirely. This is reflected in the low Image Quality rating. At a higher resolution (300dpi) some of this type was more recognizable but most of it was interpreted incorrectly and in real terms there was no overall improvement in Confidence. BookDrive DIY produced high quality images without dark shady regions and OCR Accuracy was extremely highh on these samples. Increasing the image resolution had no real impact on either Confidence or Accuracy, however; the 300dpi samples seemed to produce slightly better figures.

The chart above shows OCR Accuracy based on images from the flatbed scanner versus BookDrive DIY. Note that figures represent percentage values rounded to 1 decimal place and floored to the nearest whole number. In summary, BookDrive DIY produced better images than the flatbed scanner with clear text and an undisputable text-area boundary. BookDrive DIY employs a v-shaped cradle and transparent platen to gently sandwich the book and ensure pages stay flat while scanning. Digital cameras are mounted either side of the cradle so that their focal planes are parallel to the pages of the book. This gives the best perspective for capturing the entire page, without having to lay the book out flat, while the lighting environment ensures that no dark fuzzy regions show up near the page binding. Discussion While conducting this test we made every attempt to ensure scanning was carried out in a consistent manner and numbers were reported correctly, however; more accurate testing (including multiple scans of every page), a larger sample base to include other print sources and more thorough checking of OCR output, among other improvements would have undoubtedly produced more precise figures. Having said that, it is not our purpose to perform benchmark testing or highlight the capabilities and limitations of any particular OCR software. We used Abbyy FineReader 8.0 Professional Edition for OCR text conversion, however; we do not endorse this product and any commercially available software would have served us just as well. Similarly, we used a Lexmark X5470 scanner though any flatbed face-down scanner could have been used for the purpose of this test.