Prediction of OCR Accuracy

Similar documents
Determining the Resolution of Scanned Document Images

SOURCE SCANNER IDENTIFICATION FOR SCANNED DOCUMENTS. Nitin Khanna and Edward J. Delp

Image Processing Based Automatic Visual Inspection System for PCBs

Visual Structure Analysis of Flow Charts in Patent Images

International Journal of Advanced Information in Arts, Science & Management Vol.2, No.2, December 2014

COLOR-BASED PRINTED CIRCUIT BOARD SOLDER SEGMENTATION

How To Filter Spam Image From A Picture By Color Or Color

Text Localization & Segmentation in Images, Web Pages and Videos Media Mining I

OCR and 2D DataMatrix Specification for:

Environmental Remote Sensing GEOG 2021

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Signature verification using Kolmogorov-Smirnov. statistic

Cursive Handwriting Recognition for Document Archiving

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Measuring Line Edge Roughness: Fluctuations in Uncertainty

Automatic Detection of PCB Defects

Morphological segmentation of histology cell images

Automatic License Plate Recognition using Python and OpenCV

Drawing a histogram using Excel

Method for detecting software anomalies based on recurrence plot analysis

Tracking and Recognition in Sports Videos

A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

Image Spam Filtering Using Visual Information

Circle Object Recognition Based on Monocular Vision for Home Security Robot

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

LIST OF CONTENTS CHAPTER CONTENT PAGE DECLARATION DEDICATION ACKNOWLEDGEMENTS ABSTRACT ABSTRAK

A Method of Caption Detection in News Video

Low-resolution Character Recognition by Video-based Super-resolution

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

The Scientific Data Mining Process

Neural Network based Vehicle Classification for Intelligent Traffic Control

Lecture 9: Introduction to Pattern Analysis

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

Scanning for OCR Text Conversion

Document Image Retrieval using Signatures as Queries

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

OCRopus Addons. Internship Report. Submitted to:

: Introduction to Machine Learning Dr. Rita Osadchy

WHAT You SHOULD KNOW ABOUT SCANNING


Face detection is a process of localizing and extracting the face region from the

Analecta Vol. 8, No. 2 ISSN

A Study of Automatic License Plate Recognition Algorithms and Techniques

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Image Spam Filtering by Content Obscuring Detection

Automatic Calibration of an In-vehicle Gaze Tracking System Using Driver s Typical Gaze Behavior

GOCR Optical Character Recognition. Joerg Schulenburg, LinuxTag 2005 GOCR

Page Frame Detection for Marginal Noise Removal from Scanned Documents

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

Keywords image processing, signature verification, false acceptance rate, false rejection rate, forgeries, feature vectors, support vector machines.

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Signature Segmentation from Machine Printed Documents using Conditional Random Field

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Navigation Aid And Label Reading With Voice Communication For Visually Impaired People

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Adaptive Face Recognition System from Myanmar NRC Card

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Research on Chinese financial invoice recognition technology

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Form Design Guidelines Part of the Best-Practices Handbook. Form Design Guidelines

Colour Image Segmentation Technique for Screen Printing

Multimedia Document Authentication using On-line Signatures as Watermarks

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Feature Selection vs. Extraction

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Colorado School of Mines Computer Vision Professor William Hoff

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Learn about OCR: Optical Character Recognition Track, Trace & Control Solutions

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

AN EXPERT SYSTEM TO ANALYZE HOMOGENEITY IN FUEL ELEMENT PLATES FOR RESEARCH REACTORS

More Local Structure Information for Make-Model Recognition

Tesseract OCR Engine. What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007

Limitations of Human Vision. What is computer vision? What is computer vision (cont d)?

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Script and Language Identification for Handwritten Document Images. Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly

Signature Region of Interest using Auto cropping

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

3)Skilled Forgery: It is represented by suitable imitation of genuine signature mode.it is also called Well-Versed Forgery[4].

Data, Measurements, Features

Poker Vision: Playing Cards and Chips Identification based on Image Processing

An Overview of the Tesseract OCR Engine

3D Scanner using Line Laser. 1. Introduction. 2. Theory

Euler Vector: A Combinatorial Signature for Gray-Tone Images

High Resolution Fingerprint Matching Using Level 3 Features

Face Recognition in Low-resolution Images by Using Local Zernike Moments

EPSON SCANNING TIPS AND TROUBLESHOOTING GUIDE Epson Perfection 3170 Scanner

Character Image Patterns as Big Data

Using Lexical Similarity in Handwritten Word Recognition

Automatic Extraction of Signatures from Bank Cheques and other Documents

Transcription:

Prediction of OCR Accuracy Luis R. Blando, 1 Junichi Kanai, Thomas A. Nartker, and Juan Gonzalez 1 Introduction The accuracy of all contemporary OCR technologies varies drastically as a function of input image quality [Rice 92, Rice 93, Chen 93, Rice 94]. Given high quality images, many devices consistently deliver output text in excess of 99% correct. For low quality images, even images which are easily read by a human, output accuracy is frequently below 90%. This extreme sensitivity to quality is well known in the document analysis field and is the subject of much current research. In this ongoing project, we have been interested in developing measures of image quality. We are especially interested in learning to predict OCR accuracy using some combination of image quality measures independent of OCR devices themselves. Reliable algorithms for measuring print quality and predicting OCR accuracy would be valuable in several ways. First, in large scale document conversion operations, they could be used to automatically filter out pages that are more economically recovered via manual entry. Second, they might be employed iteratively as part of an adaptive image enhancement system. At the same time, studies into the nature of image quality can contribute to our overall understanding of the effect of noise on algorithms for classification. In this paper, we propose a prediction technique based upon measuring features associated with degraded characters. In order to limit the scope of the research, the following assumptions are made: Pages are printed in black and white (no color). Page images have been segmented, and text regions have been correctly identified. The image-based prediction system extracts information from text regions only. This prediction system simply classifies the input image as either good (i.e., high accuracy expected) or poor (i.e., low accuracy expected)). 2 Method Features associated with degraded character images that cause OCR errors are used to determine the quality of the input image and to predict accuracy. A small set of sample pages with low character accuracy was carefully inspected, and the following observations were made. 1. Blando is now a member of Harmonix Corp. in Woburn, MA. 51

Observation 1: Pages with characters whose strokes are thick tend to have many touching characters. Another by-product of these characters is that the loops in letters like a and e often get filled up completely or present only a minimal white portion in the center as shown in Figure 1. Thus, a metric that could capture the existence of these minimally open loops would detect problems related to touching characters. Figure 1: Filled loops Observation 2: Broken characters are usually fragmented into small pieces, and character fragments could have almost any shape as shown in Figure 2. A metric that could weight the existence of these character fragments would detect problems related to broken characters. Figure 2: Broken characters Observation 3: pages with inverse video (white letters on black background) or artistic typesetting tend to produce more OCR errors. Based on these observations, three simple classification rules were developed. The training data utilized included 12 clean pages and 12 degraded pages. These pages were extracted from the DOE data sets. For 21 pages, the median accuracy was computed from the output of eight OCR systems [Rice 92]. For the remaining 3 pages, the median accuracy was computed from the output of six newer systems [Rice 94]. Since the highest character accuracy obtained from the degraded pages was 88.76%, good corresponds to an expected accuracy above 90%. On the other hand, poor corresponds to an expected accuracy below 90%. To detect minimally open loops (Observation 1), a White Speckle Factor (WSF) was defined. A white speckle is any 8-connected white component whose size is less than or equal to 3 pixels in height and width. It is defined as: WSF # white bounding rectangle smaller than 3 by 3 ---------------------------------------------------------------------------------------------------------------- # white bounding rectangles 52

This metric weighs the amount of white-speckle present. It is expected that image quality goes down as this ratio goes up. Because of the small quantity of training data, the threshold value for identifying poor quality pages was manually determined to be 0.1. Thus, the first rule is as follows: Rule 1: if WSF 0.1 then quality_is_poor To measure the amount of broken characters in a given page (Observation 2), a Broken Character Factor was defined. In general, the sizes and shapes of character fragments vary widely. Thus, their bounding rectangles will have many different widths and heights. When a frequency distribution of the bounding rectangles is plotted as a 3-D histogram, such as Figure 3, the bounding rectangles of character fragments appear near the origin. Thus, this region was defined as the broken character zone as shown in Figure 4. It is important to note that the broken character zone is designed to collect all small black connected components. These small components are mostly character fragments but can also be dots, such as the dot of i and a period, and other small legal characters in small type sizes. Since these dots will fall inside of the broken character zone, a density-based measurement is sensitive to the distribution of characters in the page. Therefore, the Broken Character Factor (BCF) uses area coverage rather than density to make a decision. To measure the degree of covering of this zone, it is divided into square cells, at a rate of one per square pixel. The bounding rectangles of black connected components are allocated to these cells according to their width and height. After these cells are filled by the connected components, the Broken Character Factor is computed as: BCF # of cells occupied -------------------------------------------- # of cells Broken Char Zone in Relation to other character zones l s zone f & l zone ascender/descender zone t s zone Height i s bodies zone x-height letters zone Broken Chars Zone Dots Zone Width Figure 3: Broken Character Zone and Other Character Zones 53

Height Diagonal RefY Reference Point 75% 60% 15% 15% 60% 75% RefX Width Figure 4: Definition of Broken Character Zone To eliminate the effects of font sizes, the average height and the average width of bounding rectangles are used as a reference point, and the shape of the broken character zone is defined as shown in Figure 5. From observations on the training data, an area coverage of 70% or more is a very strong indicator of the presence of many broken characters on the page. Thus, the second rule is defined as Rule 2: if BCF > 0.7 then quality_is_poor A third rule that uses the number of white connected components and their sizes as features was also defined to detect inverse video regions (Observation 3). In an inverse video region, white connected components correspond to characters. If either the average height or the average width of white connected components exceeds 30 pixels (approximately 7 points), they are highly likely to be characters. Moreover, the number of black connected components in the region should be small because of the background. Thus, the ratio of the number of black connected components to the number of white connected components also provides useful information. The third rule is defined as: Rule 3: if max(avg. White Width, Avg. White Height) > 30 pixels and # black connectedcomponents ------------------------------------------------------------------------- < 1.5 # white connected components then quality_is_poor This prediction algorithm classifies a given page as poor if at least one of these three rules is activated. Otherwise, it is classified as good. 54

3 Experimental Setup Two sets of test data were utilized. The first set is a subset of DOE samples. The complete set consists of 460 pages that were selected at random from a collection of approximately 2,500 scientific and technical documents (approximately 100,000 pages). Each page was digitized at 300 dpi using a Fujitsu M3096M+ scanner to produce a binary image [Rice 94]. Since 21 pages in this data set were used to train the image-based prediction system, the remaining 439 pages were used as the DOE test set. The second set consists of 200 pages selected from 100 magazines that had the largest paid circulation in the U.S. in 1992 as reported by Advertising Age magazine [AdAge 93]. For each magazine, two pages were selected at random. Each page was digitized at 300 dpi using a Fujitsu M3096G scanner. The binary images were generated using a fixed threshold of 127 out of 255 by this gray scale scanner. Thus, 21 pages were used as a training set, and two sets containing 439 pages and 200 pages, respectively, were used as the test data sets. The performance of this prediction system was measured using the median character accuracy of six systems 1 obtained in the Third Annual Test of OCR Accuracy [Rice 94]. Each character insertion, deletion, or substitution needed to correct the OCR generated text was counted as an error. Each reject character was also counted as a substitution error in this calculation. Character accuracy is defined as: Character accuracy ( n # errors) --------------------------------- n where n is the total number of characters in the ground truth data. 4 Results and analysis Table 1 shows the confusion matrix for the DOE test set. The classifier misclassfied 15 poor quality pages as good and 53 good quality pages as poor. Thus, its error rate was 15%. The 15 poor good misclassifications were carefully examined. The page images did not present substantial degradation. All but four of these problematic pages contained numerical tables in them. The OCR output for many of these tables were illegible and generally useless. Yet, from an image defects point of view, it could find no evidence justifying a poor label. Table 2 shows the confusion matrix without pages containing tables. Table 3 summarizes the relationship between the number of black connected components in a page and the classification result. The table shows that this system has problems in classifying pages containing less than 200 connected components. Table 4 shows the confusion matrix for the magazine data. The error rate was 13.5%, and similar performance was observed. 1. These six systems, which will not be identified with experimental results, are Caere OCR (Ver.132), Calera Word- Scan (Ver. 4), EDT ImageReader (Ver 2.0), EperVision RTK (Ver. 3.0), Reccognita Plus DTK (Ver. 2.00D12), and XIS OCR Engine (Ver. 10). 55

Table 1: Confusion Matrix for Sample 2 (439 pages) True ID Good Recognized Poor Good 354 48 Poor 15 22 Table 2: Confusion Matrix with Tables as Rejects Tue ID Good Recognized Poor Pages with Tables Good 257 42 103 Poor 4 18 15 Table 3: Confusions vs. Number of Connected Components (Tables filtered) IS-AS Number of Connected Components [0-100] (100-200] (200-300] (300-400] (400-500] (500++ ] Poor-Poor 6 1 4 0 0 7 Poor-Good 2 2 0 0 0 0 Good-Poor 6 1 2 2 1 30 Good-Good 29 17 9 5 4 193 Table 4: Confusion Matrix for Magazines (200 pages) True ID Good Recognized Poor Good 159 27 Poor 1 13 56

5 The DOE test data as training data Figure 5 shows a scatter diagram for feature vectors consisting of White Speckle Factor and Broken Character Factor obtained from Sample 2. The diagram shows three classes: good quality pages, poor quality due to broken characters, and poor quality due to touching characters. To study whether or not a statistical classification method would improve the accuracy, the nearest mean rules using Mahalanobis distance was developed. The distance from the mean vector of a class i to a vector x to be classified is defined as: D i 2 ( x) ( x µi ) t K 1 i ( x µ i ) where K is a covariance matrix. The mean vector and covariance matrix of each class are as follows: Good: µ 0.0261 g 0.2381, K g Touching: µ 0.1185 t 0.1553, K t Broken: µ 0.1128 b 0.8290, K b 0.0031 0.0014 0.0014 0.0275 0.0186 0.0030 0.0030 0.0110 0.0130 0.0008 0.0008 0.0053 Figure 6 show the decision boundaries. 1 0.9 WSF 0.1 Broken Character Factor 0.8 0.7 0.6 0.5 0.4 0.3 BCF 0.7. Good + Poor 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 White Speckle Factor Figure 5: Scatter Diagram for the feature vectors with the heuristic decision boundaries. 57

1 0.9 WSF 0.1 Broken Character Factor 0.8 0.7 0.6 0.5 0.4 0.3 BCF 0.7. Good + Poor 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 White Speckle Factor Figure 6: Decision boundaries for the nearest mean rule. The x s show the location of mean vectors. Two heuristics rules, 1 and 2, were replaced by the nearest mean rules. Table 5 and 6 show that the confusion matrices for the DOE test data and the magazines, respectively, generated by the nearest mean classifier. Although some errors were corrected, some new errors were introduced by this classifier. Hence, no improvement was made. The results showed that better features had to be identified before applying statistical pattern recognition techniques. Table 5: Confusion Matrix for Sample Using the Nearest Mean Rule True ID Good Recognized Poor Good 353 49 Poor 12 25 Table 6: Confusion Matrix for Magazines Using the Nearest Mean Rule True ID Good Recognized Poor Good 157 29 Poor 1 13 58

6 Summary A method for predicting the accuracy of OCR generated text has been presented. Our approach measures features associated with degraded text images and does not use any output from an OCR system. Experimental results support the feasibility of this approach. However, some OCR errors, such as recognition of numerical tables, are not caused by image degradation and cannot be detected by this approach. To improve the performance of the prediction system, better features must be identified before applying statistical pattern recognition techniques. Acknowledgment The authors thank Professor Nagy (RPI) for helpful discussions. Bibliography [Blando 94] [Bohner 77] [Bokser 92] [Chen 94] [Esakov 94] [Dickey 91] [Duda 73] [Rice 92] [Rice 93] [Rice 94] Blando, Luis R., Evaluation of Page Quality Using Simple Features, Master s Thesis, University of Nevada, Las Vegas, 1995. Bohner, M., et. al., An Automatic Measurement Device for the Evaluation of the Print Quality of Printed Characters, Pattern Recognition, Vol. 9, 1977, pp. 11-19. Bokser, Mindy, Omnidocument Technologies, Proceedings of the IEEE, Vol. 80, No. 7, July 1992. Chen, Su; Subramaniam, Suresh; Haralick, Robert, M.; and Phillips, Ihsin, T., Performance Evaluation of Two OCR Systems, In Poc. of the Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 11-13, 1994. Esakov, Jeffery; Lopresti, Daniel P.; and Sandberg, Jonathan S., Classification and distribution of optical character recognition errors, Document Recognition, ed. L. M. Vincent and T. Pavlidis, SPIE Proceedings Series, Vol. 1281, 1994. Dickey, Lois A., Operational Factors in the Creation of Large Full-Text Databases, DOE Infrotech Conference, Oak Ridge, TN, Mary 1991. Duda, Richard O., and Hart, Peter, E., Pattern Classification and Scene Analysis, John Wiley & Sons, 1973. [Nartker 92] Rice, Stephen V.; Kanai, Junichi; and Nartker, Thomas A., A Report on the Accuracy of OCR Devices, Technical Report TR-92-02, ISRI, University of Nevada, Las Vegas, 1992. Rice, Stephen V.; Kanai, Junichi; and Nartker, Thomas A., An Evaluation of OCR Accuracy, Technical Report TR-93-01, ISRI, University of Nevada, Las Vegas, 1993 Rice, Stephen V.; Kanai, Junichi; and Nartker, Thomas A., The Third Annual Test of OCR Accuracy, Technical Report TR-94-03, ISRI, University of Nevada, Las Vegas, 1994 59

60