Module 6 Other OCR engines: ABBYY, Tesseract



Similar documents
OCR for historical printings a tutorial

WHITE PAPER. 3-Heights Scan to PDF Server Basics and Applications

System Administrator s Guide. ABBYY FineReader 8.0. Corporate Edition

Manual CisWeb application A gwt-based web service that provides access to language processing tools of CIS

L A TEX in a Nutshell

Connections to External File Sources

Compare and Contrast OCR and Forms Recognition Technologies. Peter Lang and Scott Hamilton

Big Data Processing with MapReduce for E-Book

Tesseract OCR Engine. What it is, where it came from, where it is going. Ray Smith, Google Inc OSCON 2007

EBSCO MEDIA FILE TRANSFER SOFTWARE INSTALLATION INSTRUCTIONS

How To Use Titanium Studio

EPSON PERFECTION SCANNING BASICS

CONTINIA DOCUMENT CAPTURE FACTSHEET ENGLISH LANGUAGE

Preserving the Spirit of the Epoch: Digital Conversion of Nordic Music Magazines Amalie Ørum Hansen Development Consultant Gentofte Centralbibliotek

Adobe InDesign Server CS2


DATASHEET Scanned images saved as TIFF or image PDF. s with TIFF or image-based PDF attachments

Scan Process Distribute. Secure Scanning integrated in your Office Printing Infrastructure

DB2 Web Query Interfaces

Grandstream Networks, Inc. XML Configuration File Generator User Guide

Remote Access to Unix Machines

AUTOMATED CONFERENCE CD-ROM BUILDER AN OPEN SOURCE APPROACH Stefan Karastanev

ANALYSIS OF GRID COMPUTING AS IT APPLIES TO HIGH VOLUME DOCUMENT PROCESSING AND OCR

AccuRead OCR. Administrator's Guide

Records Manager 8.1. What s New

Facts about ecodms. Facts about ecodms applord gmbh Vinzenz Hopfgartner 8. July 2010

Tested configuration for Major versions of Primavera:-

Homework 9 Android App for Weather Forecast

Office of History. Using Code ZH Document Management System

Contents TeleForm Bookings... 5 TeleForm System Overview... 7 TeleForm Designer... 9 Create a Template Save a Template Adding pages for

case study NZZ Neue Zürcher Zeitung AG Archive 1780 Summary Introductory Overview ORGANIZATION: PROJECT NAME:

UTAX (UK) Ltd. Solutions Awareness

Mac OS X 10 Using the Keyboard Viewer and Character Palette

OneTouch 4.0 with OmniPage OCR Features. Mini Guide

UForge 3.4 Release Notes

Android Tutorial. Larry Walters OOSE Fall 2011

Microsoft product screenshots are reprinted with permission from Microsoft Corporation.

Automatic License Plate Recognition using Python and OpenCV

Element Payment Services is a third party software system that processes credit card and automatic bank account (ACH) transactions through PestPac.

Running a Program on an AVD

Guidelines for the submission of invoices

Third-party software is copyrighted and licensed from Kofax s suppliers.

A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

I. Delivery Flash CMS template package II. Flash CMS template installation III. Control Panel setup... 5

Whitepaper Document Solutions

ABBYY FineReader 11 Corporate Edition

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

LEXMARK DOCUMENT SOLUTIONS SUITE 3.3.8

ONE PLATFORM FOR ALL YOUR PRINT, SCAN, AND DEVICE MANAGEMENT

What s New in QuarkXPress 8

A Tutorial on dynamic networks. By Clement Levallois, Erasmus University Rotterdam

One Platform for all your Print, Scan and Device Management

HTML5 AUTOMOTIVE 2013 White Paper Series

Scanning for OCR Text Conversion

Generate Android App

Scalable and sustainable OCR & document image analysis in the cloud

About This Document 3. Integration and Automation Capabilities 4. Command-Line Interface (CLI) 8. API RPC Protocol 9.

Product Library v.1.1 EUR Release Notes. DVD Contents. January 10th, Windows. Windows. Windows 8. Server 2008 Server 2008 R2.

Going Paperless The Utah Experience. Mike Pecorelli Project Manager Utah DEQ

Table of Contents 2. Table of Contents

Using Google Drive. Using Google Drive. Information Security Requirements

Product Library 2.5 EUR. DVD Contents. Release Notes January 31st, Windows 2000 Windows Server Windows. Windows Vista. Windows.

HTML5. Turn this page to see Quick Guide of CTTC

Archiving digital documents and s in PDF/A

Right-to-Left Language Support in EMu

NUANCE The experience speaks for itself

Product Library v.2.0eur Release Notes. DVD Contents. October 8th, Windows Server 2008 Server 2008 R2. Windows 2000 Windows

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

Version 3.0 May P Xerox Mobile Print Cloud User How To and Troubleshooting Guide

OCRopus Addons. Internship Report. Submitted to:

1.0 Getting Started Guide

SNOW LICENSE MANAGER (7.X)... 3

Turnitin Blackboard 9.0 Integration Instructor User Manual

Cloud Portal for imagerunner ADVANCE

Safeguard Ecommerce Integration / API

eggplant for Cross Platform Test Automation TestPlant Nick Saunders

HP Service Manager. Software Version: 9.40 For the supported Windows and Linux operating systems. Knowledge Management help topics for printing

NaviCell Data Visualization Python API

Google Drive and more Image Capture Networked MFPs

Creating Next-Generation User Experience with Windows Aero, Windows Presentation Foundation and Silverlight on Windows Embedded Standard 7

Introduction to Digital Workflow Ticketing

ABBYY FineReader 9.0 Corporate Edition System Administrator s Guide

National Institute of Standards and Technology. HIPAA Security Rule Toolkit. User Guide

Generate Android App with Protection and Licensing

Oracle Universal Content Management

OCR and PDF Compression

Litigation Support connector installation and integration guide for Summation

Transcription:

Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 1 / 20 Module 6 Other OCR engines: ABBYY, Tesseract Uwe Springmann Centrum für Informations- und Sprachverarbeitung (CIS) Ludwig-Maximilians-Universität München (LMU) 2015-09-14

ABBYY Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 2 / 20 ABBYY

ABBYY Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 3 / 20 ABBYY: Overview Russian company with leading OCR products: FineReader (desktop product or CLI; not suitable for our purposes) FineReader Engine SDK (Windows, Mac, Linux) Recognition Server (Windows) Cloud OCR SDK an excellent comparison of the different products at www.succeed-project.eu with hints helping you choose the right one state-of-the-art binarization and document analysis (zoning without semantics) was partner in IMPACT project can recognize Fraktur (Gothic script) and uses lexica with historical spellings (not in FineReader desktop or CLI) you can use your own lexica (API) limited capability for glyph training many output formats (e.g. txt, pdf, xml) format supported by PoCoTo: ABBYY XML (not in FineReader desktop)

ABBYY Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 4 / 20 ABBYY invocation for all SDK versions (Recognition Server, Engine, Cloud service) you will need to script your commands in the next practice module (m7), you will use the Cloud service with a provided script example: recognize a directory with page images of a historical German book with FineReader Engine SDK 11 for Linux both text and xml output (with character confidences) use historical lexicon recognize Gothic typeface for i in *.tif; do done /opt/fre11.1/samples/commandlineinterface/cli \ -if $i -tet UTF8 \ -f Text -f XML --xmlwriteasciicharattributes \ -of../ ${i/.tif/.abbyy.txt} -of../ ${i/.tif/.abbyy.xml} \ -rl OldGerman -rtt Gothic

ABBYY Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 5 / 20 ABBYY example: Die Grenzboten (1841) project done at Staats- und Universitätsbibliothek Bremen (Manfred Nölte) digitization support (zoning, OCR, correction) by BBAW, Berlin (Geyken, Bönig, Haaf, Jurish, Thomas, Wiegand, Würzner)

ABBYY Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 6 / 20 ABBYY xml output box coordinates (left, top, right, bottom) <charparams l="1552" t="1199" r="1584" b="1269"> </charparams> <charparams l="1585" t="1199" r="1649" b="1269" charconfidence="85">v</charparams> <charparams l="1655" t="1217" r="1679" b="1265" charconfidence="100">e</charparams> <charparams l="1683" t="1217" r="1713" b="1267" charconfidence="39">r</charparams> <charparams l="1715" t="1199" r="1751" b="1285" charconfidence="25">h</charparams> <charparams l="1757" t="1199" r="1791" b="1267" charconfidence="100">ä</charparams> <charparams l="1795" t="1201" r="1821" b="1267" charconfidence="100">l</charparams> <charparams l="1823" t="1209" r="1843" b="1267" charconfidence="34">t</charparams> <charparams l="1845" t="1219" r="1883" b="1267" charconfidence="40">n</charparams> <charparams l="1885" t="1199" r="1905" b="1265" charconfidence="56">i</charparams> <charparams l="1909" t="1199" r="1961" b="1283" charconfidence="87">s</charparams> <charparams l="1909" t="1199" r="1961" b="1283" charconfidence="87">s</charparams> <charparams l="1953" t="1219" r="1979" b="1267" charconfidence="100">e</charparams> <charparams l="1981" t="1249" r="1999" b="1269" charconfidence="57">.</charparams> <charparams l="2000" t="1199" r="2090" b="1277"> </charparams>

ABBYY Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 7 / 20 Searchable pdf output formats: searchable pdf (87kB), text and image (⒋3kB)

ABBYY Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 8 / 20 ABBYY assessment undisputed leader in industrial space used in large-scale projects (newspapers, libraries) good preprocessing: binarization, document analysis use for 19th c. and later (even Fraktur) single gyphs can be trained, but not complete typesets closed source, cannot be adapted/trained outside of company results on early printings currently unsatisfactory

Tesseract Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 9 / 20 Tesseract

Tesseract Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 10 / 20 Tesseract: History developed by Ray Smith: 1984 PhD project sponsored by HP 1988 developed for HP scanner products 1994 project cancelled 1995 UNLV evaluation (among 3 best products) 2005 open sourced by HP since 2006: taken on by Google layout analysis, 39 languages continuous development and improvement used internally (but not exclusively) for Google books Ray Smith s tutorial slides (2014) give fascinating background and insights some documentation

Tesseract Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 11 / 20 Tesseract invocation from the command line: tesseract <imagefile> <outputbase> -l LANG example: tesseract 0001.tif 0001 -l deu-frak the output file will be 0001.txt hocr format: tesseract 0001.tif 0001 -l deu-frak html searchable pdf: tesseract 0001.tif 0001 -l deu-frak pdf you may also use a GUI (not all options are available)

Tesseract Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 12 / 20 Tesseract hocr output format either text (txt) or hocr, an html-format with embedded segmentation info hocr is a valid PoCoTo input format bounding box: x0y0 x1y1 <span class= ocrx_word id= word_1_33 title= bbox 1584 1199 1997 1284; \ x_wconf 87 lang= deu-frak dir= ltr >Verhältnisse.</span> hocr has word tokens (separated by white space) as smallest unit Ben Kiessling (Nidaba project), Kay Würzner have achieved character xml output (ABBYY-like)

Tesseract Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 13 / 20 Searchable pdf

Tesseract Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 14 / 20 Tesseract Training when to train: enable recognition of a new language rather, what is trained are glyph shapes: new alphabet (Latin, Cyrillic, Greek etc.) new typeface (Schriftart: Antiqua, Fraktur) new font (Schriftsatz: special instance of a typeface, e.g. 12 pt Caslon italic) optionally add language data (wordlists) better recognition of special glyphs (e.g. recognize long s as ſ, not s) training data are in files of the form LANG.traineddata glyph shape training data and language support data (wordlists) are tied up in the same file language data can be exchanged without retraining (better and larger wordlists)

Tesseract Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 15 / 20 Tesseract assessment in every respect not as good as ABBYY but a fascinating tool for experiments and research: open source can be trained and adapted then almost as good as ABBYY in recognition (not in preprocessing, language detection, output formats etc.) many people provide training data (Nick White: ancient Greek!) can it be used for large scale projects? recognition: yes but OCR is a whole workflow of many steps (preprocessing!) needs to be supplemented by other open source tools makes it more complicated to build and monitor, but is possible provides an independent OCR result (output of both engines can be combined for error reduction)

Evaluation Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 16 / 20 Evaluation

Evaluation Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 17 / 20 Evaluation (101 pages of Die Grenzboten) use the UNLV/ISRI toolkit (see Module 0: Software) works on text pages (compared to ground truth): ocrevalutf8 accuracy gt-file OCR-file ocrevalutf8 wordacc gt-file OCR-file combined report for many files: accsum, wordaccsum use vote for combining the output of several engines (>2) evaluation of first 101 pages of Grenzboten, mean values no postcorrection, no training voted has 1926 wrong characters, 984 wrong words engine character acc. word acc. lexicon ABBYY FRE11 9⒏86 9⒍48 OldGerman Tesseract ⒊03 9⒎78 9⒊11 German OCRopus 0.7 9⒏17 9⒈60 none voted 99.28 97.59

Evaluation Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 18 / 20 Comparison of OCR engines ABBYY: 268702 Characters 3062 Errors 98.86% Accuracy Tesseract: 268702 Characters 5954 Errors 97.78% Accuracy Errors Marked Corr.-Gener. 271 0 {e}-{c} 264 0 {<\n>}-{} 244 0 {<\n><\n>}-{} 79 0 {}-{} 55 0 { }-{ } 54 0 {en}-{m} 54 0 {s}-{s} 48 0 {m}-{n:} 46 0 {R}-{N} Errors Marked Correct-Generated 372 0 {}-{-} 246 0 {ü}-{ii} 228 0 {}-{ } 215 0 {I}-{J} 211 0 {v}-{v} 208 0 {}-{<\n>} 175 0 {u}-{n} 140 0 {n}-{u} 136 0 {c}-{e}

Evaluation Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 19 / 20 Remaining errors after voting 268702 Characters 1926 Errors 99.28% Accuracy Errors Marked Correct-Generated 315 0 {<\n>}-{} 154 0 {<\n><\n>}-{} 54 0 {}-{} 50 0 { }-{ } 43 0 { }-{} 39 0 {I}-{J} 34 0 { }-{»} 32 0 {}-{ } 30 0 {}-{.} 30 0 { }-{,,} 29 0 {u}-{n} 27 0 {n}-{u} among the most frequent errors, 637 are related to missing blank lines or different punctuation glyphs (99,52% acc. for text) the remaining errors have a long tail room for improvement: engines can be trained, better lexica, postcorrection

Evaluation Uwe Springmann Module 6 Other OCR engines: ABBYY, Tesseract 2015-09-14 20 / 20 Homework register for a free ABBYY developer account: register with name of your app (make up a project name) enter a Cloud OCR SDK promo code promo code: (ask instructor) ⒈000 pages valid until 15 of January, 2016 (thanks to Michael Fuchs of ABBYY Deutschland) install Tesseract for your OS (see Module 0: Software)