GOCR Optical Character Recognition. Joerg Schulenburg, LinuxTag 2005 GOCR



Similar documents
Optical Character Recognition. Joerg Schulenburg, LinuxTag 2005 GOCR

Visual Structure Analysis of Flow Charts in Patent Images

OCR and 2D DataMatrix Specification for:

EPSON SCANNING TIPS AND TROUBLESHOOTING GUIDE Epson Perfection 3170 Scanner

Barcodes principle. Identification systems (IDFS) Department of Control and Telematics Faculty of Transportation Sciences, CTU in Prague

Basic Excel Handbook

Portal Connector Fields and Widgets Technical Documentation

Softek Software Ltd. Softek Barcode Reader Toolkit for Android. Product Documentation V7.5.1

Form Design Guidelines Part of the Best-Practices Handbook. Form Design Guidelines

EPSON PERFECTION SCANNING BASICS

Drawing a histogram using Excel

Data Tool Platform SQL Development Tools

Glass coloured glass may pick up on scan. Top right of screen tabs: these tabs will relocate lost windows.

BCC Multi Stripe Wipe

SHARP HEALTH PLAN Provider Notice CLAIMS SETTLEMENT PRACTICES, DISPUTE RESOLUTION MECHANISM & FEE SCHEDULE NOTICE

Automated Optical Inspection is one of many manufacturing test methods common in the assembly of printed circuit boards. This list includes:

Research on Chinese financial invoice recognition technology

Learn about OCR: Optical Character Recognition Track, Trace & Control Solutions

Customer Barcoding Technical Specifications

How to Prepare a Book for Press With Scribus

Introduction to Microsoft Publisher : Tools You May Need

Creating Forms with Acrobat 10

Galaxy Morphological Classification

Netigate User Guide. Setup Introduction Questions Text box Text area Radio buttons Radio buttons Weighted...

Microsoft Word 2011: Create a Table of Contents

Introduction to OCR in Revu

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Cricut Design Space Reference Guide & Glossary

Participant Guide RP301: Ad Hoc Business Intelligence Reporting

The Wondrous World of fmri statistics

Version of Barcode Toolbox adds support for Adobe Illustrator CS

Interactive Voting System. IVS-Basic IVS-Professional 4.4

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

Printing Packaging Mailing MAIL SERVICE Signage Corporate Identity

File Magic 5 Series. The power to share information PRODUCT OVERVIEW. Revised November 2004

Thinking Man s Trader

Monitoring System Status

Printing Guide. MapInfo Pro Version Contents:

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

Environmental Remote Sensing GEOG 2021

Oracle Fusion Middleware

PAPER CRAFT. Assembly Instructions. Assembly instructions: Fifteen A4-sized sheets. Paper craft: Sixteen A4-sized sheets with 174 parts in all

Achieving 5 Nines Business Process Reliability With Barcodes. Michael Salzman, VP Marketing (408) sales@inliteresearch.

Teradata SQL Assistant Version 13.0 (.Net) Enhancements and Differences. Mike Dempsey

The GeoMedia Fusion Validate Geometry command provides the GUI for detecting geometric anomalies on a single feature.

MapInfo Professional Version Printing Guide

Enhanced Bar Code Engine

How to build text and objects in the Titler

ANDROID APPS DEVELOPMENT FOR MOBILE AND TABLET DEVICE (LEVEL I)

ClarisWorks 5.0. Graphics

Graphic Communication Desktop Publishing

Using Lexical Similarity in Handwritten Word Recognition

10.1 FUNCTIONS OF INPUT AND OUTPUT DEVICES

Introduction to Pattern Recognition

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

User Guide. Printing Unicode characters from SAP to SATO GT4xxe Printers. Version

GS1-128 Label Specifications. Version 1.0

Using Microsoft Word. Working With Objects

Matt Cabot Rory Taca QR CODES

Introduction to Dashboards in Excel Craig W. Abbey Director of Institutional Analysis Academic Planning and Budget University at Buffalo

Snap 9 Professional s Scanning Module

Done. Click Done to close the Capture Preview window.

INVENTORY MANAGEMENT. TechStorm.

Package tagcloud. R topics documented: July 3, 2015

Blender Notes. Introduction to Digital Modelling and Animation in Design Blender Tutorial - week 9 The Game Engine

Face detection is a process of localizing and extracting the face region from the

Energy Management System CANBUS Interface Specification

Scanners and How to Use Them

An Implementation of a High Capacity 2D Barcode

Contents TeleForm Bookings... 5 TeleForm System Overview... 7 TeleForm Designer... 9 Create a Template Save a Template Adding pages for

Visualization of 2D Domains

Table of Contents. I. Banner Design Studio Overview II. Banner Creation Methods III. User Interface... 8

Roof Tutorial. Chapter 3:

MassArt Studio Foundation: Visual Language Digital Media Cookbook, Fall 2013

6. Optical Character Recognition (OCR) Technology

Bar Codes. A Primer for Document Management

Using Excel for Analyzing Survey Questionnaires Jennifer Leahy

Character Creation You can customize a character s look using Mixamo Fuse:

Response Services User Guide. How to design your Response Service items 4A_ _1 1

PageX: An Integrated Document Processing and Management Software for Digital Libraries

Help on Icons and Drop-down Options in Document Editor

Getting Started with Ascent Xtrata 1.7

Univariate Regression

Topological Data Analysis Applications to Computer Vision

WHAT You SHOULD KNOW ABOUT SCANNING

Tips for optimizing your publications for commercial printing

LIST OF CONTENTS CHAPTER CONTENT PAGE DECLARATION DEDICATION ACKNOWLEDGEMENTS ABSTRACT ABSTRAK

Klaus Goelker. GIMP 2.8 for Photographers. Image Editing with Open Source Software. rocky

NAVIGATION TIPS. Special Tabs

Planning and Managing Projects with Microsoft Project Professional 2013

Instant SQL Programming

Automatic License Plate Recognition using Python and OpenCV

Hermes.Net IVR Designer Page 2 36

ACHieve Access 4.3 User Guide for Corporate Customers

Gestation Period as a function of Lifespan

Transcription:

GOCR Optical Character Recognition

GOCR, how to use it? Ask your questions everytime! how does it work (short) examples + tips and tricks tuning and coding questions + diskussion

preprocessing How does it work? threshold value detection box detection, zoning, line detection sorting and melting, dust, pictures,... call ocr engine (3 engines, 2 experimental) repeated for unknown chars postprocessing (XML, TeX, UTF, ASCII)

running GOCR gocr h gocr sample.jpg gocr m 130 sample.jpg # short man page # best case scenario # database gocr v 39 m 58 e sample.jpg # debugging + out30.png

Looking inside... assume no colors, black on white only assume no rotation, same font, all characters are s e p a r a t e d try to repair if assumptions are hurt (can fail) every char is recognized empirically based on its pixel pattern lot of possibilities to improve GOCR

Looking inside... pgm2asc (pgm2asc.c) otsu + thresholding (otsu.c), load_db (database.c), scan_boxes remove_dust (remove.c), remove_borders, detect_barcode detect_pictures, detect_rotation_angle, detect_text_lines,... char_recognition (adjust_text_lines, char_recognition) compare_unknown_with_known_chars try_to_divide_boxes,list_insert_spaces, context_correction

A sample every char is recognized empirically based on its pixel pattern not flat no gap assumed A s _ M P L f length not compared spacing failed, (upper)case failed, unknown char, wrong char

verbose GOCR ( v 39) status: # Optical Character Recognition --- gocr 0.40 # options are: -l 0 -s 0 -v 39 -c f -m 0 -d -1 -n 0 vortrag/sample.png # using unicode reading file: # popen( pngtopnm vortrag/sample.png ) # readpam: format=0x5036=p6 h*w(d*b)=45*189(3*1) # readpam: min=0 max=255 reading database: # db_path= (null) threshold value detection: # OTSU: thresholdvalue = 81 gmin=0 gmax=255 # thresholding new_threshold= 160 box detection: # scanning boxes nc= 9 avd= 15 20

verbose GOCR ( v 39) dust detection and smoothing: # auto dust size = 0 nc= 9 avd= 15 20 #... 0 white pixels removed, cs=160 nc= 9 # smooth big chars 7x16 cs=160... 0 changes in 7 of 9 special objects: # detect barcode, 0 bars, boxes-0=9 # detect pictures, frames, noalphas, mxmy= 15 20... 0 - boxes 9 # averages: mxmy= 15 20 nc= 9 n= 9 clean the border: # remove boxes on border pictures= 0 rest= 9 numc= 9 #... deleted= 0, within pictures pictures= 0 rest= 9 numc= 9 #... deleted= 0, pictures= 0 rest= 9 numc= 9 some alpha code (new rotation angle detection): # rotation angle (x,y,num) (23990,731,7) (0,0,0), pass 1 # rotation angle (x,y,num) (23990,731,7) (68352,256,4), pass 2

verbose GOCR ( v 39) line detection: # detect longest line - at y=0 crosses= 0 my=0 - at crosses= 0 dy=0 # scanning lines #... trouble on line 1, wt/%= 56 # bounds: m1= 7 m2= 2 m3= 22 m4= 24 my= 20 # counts: i1= 4 i2= 3 i3= 2 i4= 5 # all boxes of same high! num_lines= 1 # add line infos to boxes... done # mark lines box corrections: # divide vertical glued boxes, numc 9 # searching melted serifs... 0 cluster corrected, 0 new boxes... # glue broken chars nc= 9 #... 0 fragments checked, 0 fragments glued #... 2 holes glued (same=0), 0 rest glued, nc= 7 # detect dust2,... 0 + 0 boxes deleted, numc= 7

verbose GOCR ( v 39) pitch detection: # check for word pitch #...WARNING measure_pitch: only 6 spaces # overall space width is 2 proportional char recognition: # step 1: char recognition unknown= 7 picts= 0 boxes= 7 # 3 of 7 chars unidentified # adjust lines diff= 1 # step 1: char recognition unknown= 1 picts= 0 boxes= 7 # 2 of 7 chars unidentified # debug: unknown= 1 picts= 0 boxes= 7 # mark lines pnmtopng: 13 colors found # step 2: try to compare unknown with known chars - found 0

verbose GOCR ( v 39) # step 3: try to divide unknown chars # divide box: 73 8 18 26 # list pattern x= 73 8 d= 18 26 t= 1 2...@@@@@@@@@@@@... -..@@@@@@@@.@@@@....@@@...@@@@...@@@...@@@@..@@@...@@@..@@@...@@@@..@@@...@@@@@@@@@@..@@@..@@@@@@@@@@@..@@@...@@@..@@@...@@@@ @@@@...@@@ @@@...@@@.@@... - x123= 5 10 8 m123= 258 142 141 x c12 = 5 ( (?). ( 98) ( 0) x c12 =10 (?) (?). ( 0) ( 0) x c12 = 8 (?) (?). ( 0) ( 0), numc 7

verbose GOCR ( v 39) # list shape 6 x= 160 6 d= 16 27 h=0 o=1 dots=0 0066 f # list box dots=0 boxes=1... c=f mod=(0x00) line=1 m=3 4 24 29 # list box x= 160 6 d= 16 27 r= 10 0 nrun=0 # list box frames= 1 (sumvects=44) # frame 0 44 vectors= 10 0 9 1 6 1 5 2 2 2 0 4 0 5... # list box char: f(100) # list pattern x= 160 6 d= 16 27 t= 1 2...$@@$@@@@$....@@@@@@@@@..@@@@@@@@$@@$....@@@@@@@@@@@@...< $@@@@@... @@@@@@.....$@@$.....@@@@......$@@......@@@......@@$......@@@.....$@@@@$@@$.....@@@@@@@@@.....$@@@@$@@@$.....@@@@@@@@@@......@@$......@@@......@@@......@@@......@@@......@@@......$@@@...$@@$....@@@@...@@@@..$@@@@@@@@@@@@$$.@@@@@@@@@@@@@@@

verbose GOCR ( v 39) spacing: # insert space between words (dy=27)... found 6 context correction: # step 4: context correction Il1 0O # store boxtree to lines... get_least_line_indent: page_width 189, dy 0 Line 1, y 8, raw indent 11, adjusted indent 11 Minimum indent is 11... 2 lines, boxes= 6, chars= 6 # debug: (_)= 1 picts= 0 chars= 6 (A)=1 # mark lines pnmtopng: 12 colors found A s _ M P L f Elapsed time: 0:00:79.418.

Tuning gocr -C A-Z sample.png # use optional filtering output: A M P L _ (E is unexpected) gocr -m 130 -C A-Z sample.png # use database input: S A M (94%) L (74%) E output: A S A M P L E gocr -s 13 -m 130 -C A-Z sample.png # pitch output: A SAMPLE

Debugging Normally its time to contact the author sending him the sample and an explanation that 'E' is failing Also possible: Find out, why 'E' is not recognized. setting C_ASK to 'E' in ocr0.c and recompile rerun ( m 56) now for every char ocr0.c explains, why its not a 'E' output the break line in ocr0.c and the pattern output: DBG L592 (160,6): break

Debugging output: DBG L592 (160,6): break 591: for( x=dx,y=y0+dy/6; y<y1-dy/9; y++ ) // left border straight 592: { i=loop(box1->p,x0,y,dx,cs,0,ri); if( i>x+dx/9 ) Break; 593: if(i<x) x=i; 594: } if( y<y1-dy/9 ) Break; // t Bug found! Break > break; but L594 also printed out, code defines max. jump onleft site of dx/9 (dx=16 here) its to restrictive for handwritten E's (we have max jump = 3) changing it to dx/5 could cause misdetections for other chars

Debugging Why 'f' is recognized instead of 'E'? Searching for 'f' in ocr0.c add some code to distinguish between E and f leave the 'f' subfuncion with a Break or reduce probability for the 'f' check, that 'f' of all other fonts are still recognized

Testing new code minimum test base is./examples/*.png results should never be worse than before patch code bigger public database is needed some automatition is desirable telling the programmer what has changed in the output