Tesseract OCR Engine. What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007



Similar documents
An Overview of the Tesseract OCR Engine

Module 6 Other OCR engines: ABBYY, Tesseract

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Automatic License Plate Recognition using Python and OpenCV

AccuRead OCR. Administrator's Guide

AccuRead OCR. Administrator's Guide

REMOTE DEPOSIT CAPTURE MARKET

Adjunct Assistant Professor Gayle Rembold Furbert. Typography II. County College of Morris Graphic Design Degree Program

Electronic Document Management System (EDMS) Insert Subtitle Here

What is the Future for Mail Sorting?

Excel: Analyze PowerSchool Data

Document Image Processing - A Review

Optical Character Recognition (OCR)

EPSON PERFECTION SCANNING BASICS

Optical Character Recognition. Joerg Schulenburg, LinuxTag 2005 GOCR

Document Management with Workgroup Scanners. How to guide

Fall Lecture 1. Operating Systems: Configuration & Use CIS345. Introduction to Operating Systems. Mostafa Z. Ali. mzali@just.edu.

Matt Cabot Rory Taca QR CODES

4) How many peripheral devices can be connected to a single SCSI port? 4) A) 8 B) 32 C) 1 D) 100

The Re-emergence of Data Capture Technology

Chapter 3. Application Software. Chapter 3 Objectives. Application Software. Application Software. Application Software. What is application software?

Contents. BMC Atrium Core Compatibility Matrix

Kofax Solution Brief. Kofax Enterprise Capture Solutions Enable Document-driven Business Processes in SharePoint

CHAPTER 5: PRODUCTIVITY APPLICATIONS

Veco User Guides. Document Management

Proposal submitted by

QQConnect Overview Guide

Chapter 3. Application Software. Chapter 3 Objectives. Application Software

Adaptive Automated GUI Testing Producing Test Frameworks to Withstand Change

PROCESSING & MANAGEMENT OF INBOUND TRANSACTIONAL CONTENT

Web Applications: Overview and Architecture

Use this translator to save ArchiCAD layouts/views in DXF/DWG format if you plan to continue

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

ELECTRONIC DOCUMENT IMAGING

Kofax Transformation Modules Generic Versus Specific Online Learning

Discovering Computers Chapter 3 Application Software

The power of root on Android emulators

DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH

GOCR Optical Character Recognition. Joerg Schulenburg, LinuxTag 2005 GOCR

eggplant for Cross Platform Test Automation TestPlant Nick Saunders

COMPUTER - INPUT DEVICES

How much of my Mailbox size have I used?

3D Building Roof Extraction From LiDAR Data

Computing Concepts with Java Essentials

Multimedia Applications

User Guide Bet Mover

10.1 FUNCTIONS OF INPUT AND OUTPUT DEVICES

Implementing Optical Character Recognition on the Android Operating System for Business Cards

Document scanning. Done right. TM

TIRED OF TYPING? ACCELERATE YOUR BUSINESS WITH: AUTOMATED DATA ENTRY PURCHASE APPROVAL WORKFLOW DIGITAL DOCUMENT ARCHIVE DOCUMENT CAPTURE

HP Records Manager. Release Notes. Software Version: 8.1. Document Release Date: June 2014

Zeenov Agora High Level Architecture

Kofax Enterprise Brief. A Roadmap to Automating Document Driven Business Processes. Implementing a Real Enterprise Capture Strategy

Compare and Contrast OCR and Forms Recognition Technologies. Peter Lang and Scott Hamilton

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Form Design Guidelines Part of the Best-Practices Handbook. Form Design Guidelines

Connections to External File Sources

Introduction to Information & Communications Technology

SmartArrays and Java Frequently Asked Questions

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots

Vision Impairment and Computing

TAKEAWAYS CHALLENGES. The Evolution of Capture for Unstructured Forms and Documents STRATEGIC WHITE PAPER

Decision Trees from large Databases: SLIQ

Office of History. Using Code ZH Document Management System

Karlen Communications

White Paper: Linking Images With Applications. The TWAIN Working Group May, 2011

How To Design A Logo For A Corporation

Inbound Routing with NetSatisFAXTion

Scan2CRM - SalesForce

WHITEPAPER. Text Analytics Beginner s Guide

STRETCH : A System for Document Storage and Retrieval by Content

Using Web Services for scanning on your network (Windows Vista SP2 or greater, Windows 7 and Windows 8)

RAPIDFILE PRODUCT BACKGROUND. Ashton-Tate's RapidFile is a high-performance file manager

Oracle Universal Content Management

Table of Contents. Introduction: 2. Settings: 6. Archive 9. Search Browse Schedule Archiving: 18

IMPROVING THE EFFICIENCY OF TESSERACT OCR ENGINE

Enhanced Imaging Options for Client Profiles for Windows

Using Google Drive. Using Google Drive. Information Security Requirements

Optical Character Recognition by Open Source OCR Tool Tesseract: A Case Study

Konica Minolta Unity Document Suite. Powerful integrated document processing. Document capture & distribution Unity Document Suite

Overview of MIS Professor Merrill Warkentin

Transcription:

Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007

Contents Introduction & history of OCR Tesseract architecture & methods Announcing Tesseract 2.00 Training Tesseract Future enhancements

A Brief History of OCR What is Optical Character Recognition? OCR My invention relates to statistical machines of the type in which successive comparisons are made between a character and a charac-

A Brief History of OCR OCR predates electronic computers! US Patent 1915993, Filed Apr 27, 1931

A Brief History of OCR 1929 Digit recognition machine 1953 Alphanumeric recognition machine 1965 US Mail sorting 1965 British banking system 1976 Kurzweil reading machine 1985 Hardware-assisted PC software 1988 Software-only PC software 1994-2000 Industry consolidation

Tesseract Background Developed on HP-UX at HP between 1985 and 1994 to run in a desktop scanner. Came neck and neck with Caere and XIS in the 1995 UNLV test. (See http://www.isri.unlv.edu/downloads/at-1995.pdf ) Never used in an HP product. Open sourced in 2005. Now on: http://code.google.com/p/tesseract-ocr Highly portable.

Tesseract OCR Architecture Input: Gray or Color Image [+ Region Polygons] Adaptive Thresholding Binary Image Character Outlines Organized Into Words Find Text Lines and Words Recognize Word Pass 1 Character Outlines Connected Component Analysis Recognize Word Pass 2

Adaptive Thresholding is Essential Some examples of how difficult it can be to make a binary image Taken from the UNLV Magazine set. (http://www.isri.unlv.edu/isri/ocrtk )

Baselines are rarely perfectly straight Text Line Finding skew independent published at ICDAR 95 Montreal. (http://scholar.google.com/scholar?q=skew+detection+smith) Baselines are approximated by quadratic splines to account for skew and curl. Meanline, ascender and descender lines are a constant displacement from baseline. Critical value is the x-height.

Spaces between words are tricky too Italics, digits, punctuation all create special-case font-dependent spacing. Fully justified text in narrow columns can have vastly varying spacing on different lines.

Tesseract: Recognize Word Character Chopper Character Associator Done? No Yes Adapt to Word Static Character Classifier Dictionary Adaptive Character Classifier Number Parser

Outline Approximation Original Image Outlines of components Polygonal Approximation Polygonal approximation is a double-edged sword. Noise and some pertinent information are both lost.

Tesseract: Features and Matching Prototype Character to classify Extracted Features Match of Prototype To Features Static classifier uses outline fragments as features. Broken characters are easily recognizable by a small->large matching process in classifier. (This is slow.) Adaptive classifier uses the same technique! (Apart from normalization method.) Match of Features To Prototype

Announcing tesseract-2.00 Fully Unicode (UTF-8) capable Already trained for 6 Latin-based languages (Eng, Fra, Ita, Deu, Spa, Nld) Code and documented process to train at http://code.google.com/p/tesseract-ocr UNLV regression test framework Other minor fixes

Training Tesseract Tesseract Data Files Word List Training page images Tesseract +manual correction Tesseract Wordlist2dawg Character Features (*.tr files) mftraining cntraining User-words Word-dawg, Freq-dawg inttemp, pffmtable normproto Box files Unicharset_extractor unicharset Addition of character properties unicharset Manual Data Entry DangAmbigs

Tesseract Dictionaries Usually Empty Tesseract Data Files User-words Word List Infrequent Word List Frequent Word List Wordlist2dawg Word-dawg, Freq-dawg

Tesseract Shape Data Tesseract Data Files Prototype Shape Features Expected Feature Counts Training page images Tesseract +manual correction Tesseract Character Features (*.tr files) mftraining cntraining inttemp, pffmtable normproto Box files Character Normalization Features

Tesseract Character Data Tesseract Data Files Training page images Tesseract +manual correction List of Characters + ctype information Box files Unicharset_extractor unicharset Addition of character properties unicharset Typical OCR errors eg e<->c, rn<->m etc Manual Data Entry DangAmbigs

Accuracy Results Comparison of current results against 1995 UNLV results Testid Testset Character Non-stopword Errors Accuracy Change Errors Accuracy Change 1995 Bus.3B 5959 98.14% 1293 95.73% 1995 Doe3.3B 36349 97.52% 7042 94.87% 1995 Mag.3B 15043 97.74% 3379 94.99% 1995 News.3B 6432 98.69% 1502 96.94% Gcc4.1 Bus.3B 6258 98.04% 5.02% 1312 95.67% 1.47% Gcc4.1 Doe3.3B 28589 98.05% -21.35% 6692 95.12% -4.97% Gcc4.1 Mag.3B 14800 97.78% -1.62% 3123 95.37% -7.58% Gcc4.1 News.3B 7524 98.47% 16.98% 1220 97.51% -18.77% Gcc4.1 Total 57171-10.37% 12347-6.58%

Commercial OCR v Tesseract 100+ languages. Accuracy is good now. Sophisticated app with complex UI. Works on complex magazine pages. Windows Mostly. Costs $130-$500 6 languages + growing. Accuracy was good in 1995. No UI yet. Page layout analysis coming soon. Runs on Linux, Mac, Windows, more... Open source Free!

Tesseract Future Page layout analysis. More languages. Improve accuracy. Add a UI.

The End For more information see: http://code.google.com/p/tesseract-ocr